This report explores the pricing dynamics of Airbnb listings in Sydney by utilising machine learning classification models to predict Sydney property price categories Budget (<$100), MidMarket ($100-$200) and Premium (>$200) using property characteristics, location data and host information. Our analysis addresses key Australian housing market challenges while providing actionable insights for tourism and rental property sectors. We begin analysis with detailed data cleaning such as formatting, handling missing values and addressing outliers. EDA performed on the dataset highlights geographic clustering of premium properties around Sydney Harbour and CBD, while budget options are more spread towards outer suburbs. Overall, this study demonstrates how data-driven classification can uncover meaningful patterns in Airbnb pricing, supporting more informed decision-making across the platform’s ecosystem.
Can we predict whether a Sydney Airbnb property will be classified as Premium (>$200/night), MidMarket ($100-200/night), or Budget (<$100/night) based on property characteristics, location and host factors?
This project focuses on a multi-class classification problem with three distinct target categories:
The classification approach enables predictive insights for property investors, market segmentation analysis for tourism planning and guiding in pricing strategy guidance for potential hosts.
While property prices exist on a continuous scale, converting them into discrete market segments provides substantial practical and strategic value for multiple stakeholders:
1. Consumer Decision-Making and Search Behavior
Travelers typically approach accommodation search with a budget category in mind rather than exact price points. The three-tier classification reflects natural consumer behavior patterns where users mentally categorize options as “budget-friendly,” “mid-range,” or “luxury” before drilling down into specific listings. This categorization mirrors common filtering mechanisms on booking platforms.
2. Investment and Portfolio Strategy
Property investors require clear market positioning to guide acquisition and renovation decisions. A binary determination of whether a property will command Budget, Mid-Market, or Premium rates directly informs: - Renovation budget allocation and expected ROI - Target demographic and marketing positioning - Competitive positioning within specific neighborhoods - Risk assessment for new property investments
3. Regulatory and Policy Applications
Australian housing policy and short-term rental regulations often distinguish between different accommodation tiers. Premium properties may face different regulatory scrutiny regarding their impact on long-term housing availability compared to budget options. Classification models can inform evidence-based policy decisions about short-term rental impacts on housing affordability.
4. Market Segmentation and Pricing Strategy
Hosts benefit from understanding which category their property naturally falls into based on structural features, location, and amenities. Rather than marginally adjusting a continuous price, hosts can make strategic decisions about whether feature upgrades would move their property into a higher tier, fundamentally changing their market position and revenue potential.
5. Tourism Planning and Economic Analysis
Sydney’s tourism industry and economic planners require segmented accommodation data to understand market composition. Classification reveals whether the city has adequate budget options for students and backpackers, sufficient mid-market options for families, and appropriate luxury inventory for high-spending tourists. This information guides tourism infrastructure planning and economic development strategies.
6. Statistical and Modeling Considerations
From an analytical perspective, discrete categories reduce the impact of measurement noise in self-reported nightly rates, handle non-linear relationships between features and price tiers more effectively than linear regression assumptions, and provide clearer, more actionable insights than continuous predictions with confidence intervals.
This classification framework transforms a continuous prediction problem into an actionable decision support tool, providing clear categorical predictions that align with how stakeholders actually use pricing information in real-world decisions.
The Sydney Airbnb Listings dataset contains detailed information on over 18,000 listings across the city, with approximately 79 variables describing property characteristics, host details, geographic location, availability and customer engagement. Key attributes include listing identifiers, host information, neighbourhoods, room type, number of reviews, minimum nights, availability and pricing. For the purpose of this study, the focus is on the price variable, which has been cleaned to remove formatting and extreme outliers, and subsequently transformed into a categorical target variable representing three market segments (Inside Airbnb, 2025; Cox, 2024).
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
##
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
##
## The following object is masked from 'package:datasets':
##
## sleep
## corrplot 0.95 loaded
library(ggplot2) # Advanced plotting
library(dplyr) # Data manipulation
library(readr) # Reading CSV files
library(stringr) # String manipulation
library(plotly) # Interactive plots##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(knitr) # Table formatting
library(DT) # Interactive tables
library(MLmetrics) # Machine learning metrics##
## Attaching package: 'MLmetrics'
##
## The following object is masked from 'package:base':
##
## Recall
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following object is masked from 'package:colorspace':
##
## coords
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
Primary Dataset: Inside Airbnb Sydney Listings
Reference: Inside Airbnb. (2025). Sydney, New South Wales, Australia Dataset. Retrieved from http://insideairbnb.com/get-the-data/. Data sourced from publicly available information from Airbnb.com. Murray Cox, Inside Airbnb Project.
Data Collection Method: Web scraping of publicly available Airbnb listing information
Data Currency: Most recent quarterly snapshot available (2025)
## Rows: 18187 Columns: 79
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (25): listing_url, source, name, description, neighborhood_overview, pi...
## dbl (42): id, scrape_id, host_id, host_listings_count, host_total_listings_...
## lgl (7): host_is_superhost, host_has_profile_pic, host_identity_verified, ...
## date (5): last_scraped, host_since, calendar_last_scraped, first_review, la...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Replace all "N/A" values with blank in character columns
char_cols <- names(listings_raw)[sapply(listings_raw, is.character)]
for(col in char_cols) {
listings_raw[[col]][listings_raw[[col]] == "N/A"] <- ""
}
# Define variables
dims <- paste(dim(listings_raw), collapse = " x ")
nvars <- ncol(listings_raw)
# Print all in one cat
cat(" Full dataset dimensions:", dims, "\n", "Total variables available:", nvars, "\n")## Full dataset dimensions: 18187 x 79
## Total variables available: 79
Given the comprehensive nature of the Inside Airbnb dataset (18187 listings x 79 features), we employ a strategic feature selection approach focusing on variables most relevant to pricing classification.
# FEATURE SELECTION: Selecting the most relevant variables for classification
selected_features <- c(
"id", "price", "property_type", "room_type", "accommodates",
"bedrooms", "bathrooms", "amenities", "neighbourhood_cleansed",
"latitude", "longitude", "host_is_superhost", "host_response_rate",
"host_listings_count", "host_identity_verified", "review_scores_rating",
"number_of_reviews", "reviews_per_month", "availability_365",
"minimum_nights"
)# Basic dataset information
listings_raw <- listings_raw %>%
mutate(across(where(~ all(. %in% c("t", "f"))), ~.=="t"))
# Selected only the chosen features
listings <- listings_raw %>%
select(all_of(selected_features))
# Variable types
numeric_vars <- listings %>% select_if(is.numeric) %>% names()
character_vars <- listings %>% select_if(is.character) %>% names()
boolean_vars <- listings %>% select_if(is.logical) %>% names()
cat(" DATASET SUMMARY:\n", "Number of observations:", nrow(listings), ", Number of variables:", ncol(listings), "\n", "\n VARIABLE TYPES:\n", "Numeric variables (", length(numeric_vars), "):", paste(numeric_vars, collapse = ", "), "\n","Character variables (", length(character_vars), "):", paste(character_vars, collapse = ", "), "\n", "Boolean variables (", length(boolean_vars), "):", paste(boolean_vars, collapse = ", "), "\n")## DATASET SUMMARY:
## Number of observations: 18187 , Number of variables: 20
##
## VARIABLE TYPES:
## Numeric variables ( 12 ): id, accommodates, bedrooms, bathrooms, latitude, longitude, host_listings_count, review_scores_rating, number_of_reviews, reviews_per_month, availability_365, minimum_nights
## Character variables ( 6 ): price, property_type, room_type, amenities, neighbourhood_cleansed, host_response_rate
## Boolean variables ( 2 ): host_is_superhost, host_identity_verified
To simplify classification process, the continuous target variable Price was transformed into a categorical outcome representing distinct market segments. Raw price values, originally stored as character strings with currency symbols and commas were first cleaned and converted into numeric format. Extreme outliers such as nightly rates in higher ranges were excluded to reduce noise and improve model stability.
Additionally, to prevent issues with rare categorical levels appearing only in test data, we preprocess high-cardinality categorical variables by grouping rare categories into an “Other” category.
# Creating target variable based on price thresholds
# Cleaning price data
listings$price_numeric <- as.numeric(gsub("[$,]", "", listings$price))
# Creating price categories
listings$price_category <- cut(
listings$price_numeric,
breaks = c(0, 100, 200, Inf),
labels = c("Budget", "MidMarket", "Premium"),
include.lowest = TRUE
)
# Summaries
price_summary <- summary(listings$price_numeric)
target_dist <- table(listings$price_category)
target_props <- prop.table(target_dist) * 100
cat(
"CREATING TARGET VARIABLE:\n\n",
"PRICE SUMMARY (in $ per night):\n",
sprintf("Min : %.2f\n", price_summary["Min."]),
sprintf("1st Qu. : %.2f\n", price_summary["1st Qu."]),
sprintf("Median : %.2f\n", price_summary["Median"]),
sprintf("Mean : %.2f\n", price_summary["Mean"]),
sprintf("3rd Qu. : %.2f\n", price_summary["3rd Qu."]),
sprintf("Max : %.2f\n\n", price_summary["Max."]),
"TARGET VARIABLE DEFINITION:\n",
"- Budget : $0-100/night (Budget-conscious travelers)\n",
"- MidMarket : $100-200/night (Mainstream market)\n",
"- Premium : >$200/night (Luxury segment)\n\n",
"TARGET VARIABLE DISTRIBUTION:\n",
sprintf("Budget : %d (%.2f%%)\n", target_dist["Budget"], target_props["Budget"]),
sprintf("MidMarket : %d (%.2f%%)\n", target_dist["MidMarket"], target_props["MidMarket"]),
sprintf("Premium : %d (%.2f%%)\n", target_dist["Premium"], target_props["Premium"]),
sprintf("NaN values : %d \n", (18187-(target_dist["Budget"]+target_dist["MidMarket"]+target_dist["Premium"]))),
sep = ""
)## CREATING TARGET VARIABLE:
##
## PRICE SUMMARY (in $ per night):
## Min : 17.00
## 1st Qu. : 139.00
## Median : 206.00
## Mean : 339.47
## 3rd Qu. : 329.00
## Max : 20000.00
##
## TARGET VARIABLE DEFINITION:
## - Budget : $0-100/night (Budget-conscious travelers)
## - MidMarket : $100-200/night (Mainstream market)
## - Premium : >$200/night (Luxury segment)
##
## TARGET VARIABLE DISTRIBUTION:
## Budget : 2181 (13.86%)
## MidMarket : 5433 (34.53%)
## Premium : 8120 (51.61%)
## NaN values : 2453
# Bar Plot for Price vs Number of Properties
ggplot(listings, aes(x = price_category, fill = price_category)) +
geom_bar() +
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5) +
labs(title = "Distribution of Sydney Airbnb Price Categories",
subtitle = "Classification Target Variable",
x = "Price Category", y = "Number of Properties") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
theme(legend.position = "none")## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The raw dataset required extensive cleaning and preprocessing to ensure reliability for data analysis and modeling using classification algorithms. Non-numeric entries in the price field were removed before converting the variable values to numeric ones. Character columns containing N/A values were standardized by imputation using median strategy or by replacing with minimum values (0, 1, FALSE). The cleaned dataset provided a complete and consistent foundation with a refined set of features suitable for exploratory analysis and predictive modeling (Michelucci, 2025).
output_str <- ""
# 1. Missing values analysis
missing_summary <- listings %>%
summarise_all(~sum(is.na(.))) %>%
gather(variable, missing_count) %>%
mutate(missing_percent = round(missing_count / nrow(listings) * 100, 2)) %>%
filter(missing_count > 0) %>%
arrange(desc(missing_percent))
if(nrow(missing_summary) > 0) {
output_str <- paste0(output_str, "1. Missing Values Detected:\n")
output_str <- paste0(output_str, "Number of missing columns: ", nrow(missing_summary), "\n")
# Create formatted strings for each variable
missing_strings <- paste0(missing_summary$variable, ": ",
missing_summary$missing_count, " (",
missing_summary$missing_percent, "%)")
# Join all with | separator
output_str <- paste0(output_str, paste(missing_strings, collapse = " | "), "\n")
} else {
output_str <- paste0(output_str, "1. Missing Values: No missing values detected in selected features\n")
}
# 2. Price outliers
price_outliers <- listings %>%
filter(price_numeric > quantile(price_numeric, 0.99, na.rm = TRUE) |
price_numeric < quantile(price_numeric, 0.01, na.rm = TRUE)) %>%
nrow()
output_str <- paste0(output_str, "\n2. Price Outliers:\n")
output_str <- paste0(output_str, "Potential price outliers (beyond 1st/99th percentile): ", price_outliers, "\n")
# 3. Categorical variable complexity
output_str <- paste0(output_str, "\n3. High-Dimensional Categorical Data:\n")
output_str <- paste0(output_str, "Number of unique neighbourhoods: ", length(unique(listings$neighbourhood_cleansed)), "\n")
output_str <- paste0(output_str, "Number of unique property types: ", length(unique(listings$property_type)), "\n")
# 4. Class imbalance
min_class_prop <- min(prop.table(table(listings$price_category)))
max_class_prop <- max(prop.table(table(listings$price_category)))
imbalance_ratio <- max_class_prop / min_class_prop
output_str <- paste0(output_str, "\n4. Class Imbalance Analysis:\n")
output_str <- paste0(output_str, "Class imbalance ratio: ", round(imbalance_ratio, 2), ":1\n")
# 5. Additional challenges that can be considered
output_str <- paste0(output_str, "\n5. Additional Challenges that can be consideredD:\n")
output_str <- paste0(output_str, "- Geographic clustering effects in Sydney neighborhoods\n")
output_str <- paste0(output_str, "- Seasonal pricing variations not captured in snapshot data\n")
output_str <- paste0(output_str, "- Text processing requirements for amenities field\n")
output_str <- paste0(output_str, "- Potential correlation between location and property characteristics\n")
cat(output_str)## 1. Missing Values Detected:
## Number of missing columns: 11
## review_scores_rating: 3179 (17.48%) | reviews_per_month: 3179 (17.48%) | bathrooms: 2458 (13.52%) | price: 2453 (13.49%) | price_numeric: 2453 (13.49%) | price_category: 2453 (13.49%) | host_is_superhost: 556 (3.06%) | bedrooms: 436 (2.4%) | host_response_rate: 5 (0.03%) | host_listings_count: 5 (0.03%) | host_identity_verified: 5 (0.03%)
##
## 2. Price Outliers:
## Potential price outliers (beyond 1st/99th percentile): 301
##
## 3. High-Dimensional Categorical Data:
## Number of unique neighbourhoods: 38
## Number of unique property types: 69
##
## 4. Class Imbalance Analysis:
## Class imbalance ratio: 3.72:1
##
## 5. Additional Challenges that can be consideredD:
## - Geographic clustering effects in Sydney neighborhoods
## - Seasonal pricing variations not captured in snapshot data
## - Text processing requirements for amenities field
## - Potential correlation between location and property characteristics
output_text <- ""
# Initial data dimensions
initial_dim <- dim(listings)
output_text <- paste0(output_text, "Initial data dimensions: ", initial_dim[1], " rows x ", initial_dim[2], " columns\n\n")
# 1. Handling price data
listings$price_numeric <- as.numeric(gsub("[$,]", "", listings$price))
outlier_threshold <- 1000
initial_count <- nrow(listings)
listings <- listings %>% filter(price_numeric > 0 & price_numeric <= outlier_threshold)
removed_outliers <- initial_count - nrow(listings)
output_text <- paste0(output_text, "Removed ", removed_outliers, " extreme price outliers (>$", outlier_threshold, ")\n")
output_text <- paste0(output_text, "Remaining observations: ", nrow(listings), "\n\n")
# 2. Clean host_is_superhost
listings$host_is_superhost <- ifelse(listings$host_is_superhost == "t", TRUE, FALSE)
# 3. Handle host_response_rate
if("host_response_rate" %in% names(listings)) {
listings$host_response_rate <- as.numeric(gsub("%", "", listings$host_response_rate)) / 100
}
# 4. Process amenities
if("amenities" %in% names(listings)) {
listings$amenities_count <- ifelse(
is.na(listings$amenities) | listings$amenities == "" | listings$amenities == "[]",
0,
str_count(listings$amenities, '",') + 1
)
} else {
listings$amenities_count <- 0
}
# 5. Handle missing values
missing_summary <- listings %>%
summarise_all(~sum(is.na(.))) %>%
gather(variable, missing_count) %>%
mutate(missing_percent = round(missing_count / nrow(listings) * 100, 2)) %>%
filter(missing_count > 0) %>%
arrange(desc(missing_percent))
if(nrow(missing_summary) > 0) {
for(i in 1:nrow(missing_summary)) {
output_text <- paste0(output_text, sprintf("- %s: %d missing (%.2f%%)\n",
missing_summary$variable[i],
missing_summary$missing_count[i],
missing_summary$missing_percent[i]))
}
output_text <- paste0(output_text, "\n")
# Imputation
if("reviews_per_month" %in% missing_summary$variable) {
listings$reviews_per_month[is.na(listings$reviews_per_month)] <- 0
}
if("host_is_superhost" %in% missing_summary$variable) {
listings$host_is_superhost[is.na(listings$host_is_superhost)] <- FALSE
}
if("bathrooms" %in% missing_summary$variable) {
median_bathrooms <- median(listings$bathrooms, na.rm = TRUE)
listings$bathrooms[is.na(listings$bathrooms)] <- median_bathrooms
}
if("host_listings_count" %in% missing_summary$variable) {
listings$host_listings_count[is.na(listings$host_listings_count)] <- 1
}
if("host_identity_verified" %in% missing_summary$variable) {
listings$host_identity_verified[is.na(listings$host_identity_verified)] <- FALSE
}
if("bedrooms" %in% missing_summary$variable) {
listings$bedrooms[is.na(listings$bedrooms)] <- ceiling(listings$accommodates[is.na(listings$bedrooms)] / 2)
}
if("review_scores_rating" %in% missing_summary$variable) {
median_rating <- median(listings$review_scores_rating, na.rm = TRUE)
listings$review_scores_rating[is.na(listings$review_scores_rating)] <- median_rating
}
if("host_response_rate" %in% missing_summary$variable) {
median_response_rate <- median(listings$host_response_rate, na.rm = TRUE)
listings$host_response_rate[is.na(listings$host_response_rate)] <- median_response_rate
}
} else {
output_text <- paste0(output_text, "5. Handling missing values... No missing values detected after initial cleaning\n")
}
# Verify missing values
missing_after <- listings %>%
summarise_all(~sum(is.na(.))) %>%
gather(variable, missing_count) %>%
filter(missing_count > 0)
if(nrow(missing_after) > 0) {
output_text <- paste0(output_text, "\nVERIFYING IMPUTATION RESULTS:\n Still have missing values in:\n")
for(i in 1:nrow(missing_after)) {
output_text <- paste0(output_text, sprintf("- %s: %d missing\n",
missing_after$variable[i],
missing_after$missing_count[i]))
}
} else {
output_text <- paste0(output_text, "\nVERIFYING IMPUTATION RESULTS:\n All missing values successfully handled!\n")
}
output_text <- paste0(output_text, " Missing data imputation strategy completed.\n\n")
# 6. Feature engineering
listings <- listings %>%
mutate(
is_popular_area = neighbourhood_cleansed %in% c("Bondi", "Sydney", "Manly", "Darlinghurst", "Surry Hills"),
distance_from_cbd = sqrt((latitude - (-33.8688))^2 + (longitude - 151.2093)^2),
property_size = case_when(
accommodates <= 2 ~ "Small",
accommodates <= 4 ~ "Medium",
accommodates <= 8 ~ "Large",
TRUE ~ "Extra Large"
),
host_experience = case_when(
host_listings_count == 1 ~ "Single Property",
host_listings_count <= 5 ~ "Small Portfolio",
TRUE ~ "Large Portfolio"
),
availability_level = case_when(
availability_365 < 90 ~ "Low",
availability_365 < 180 ~ "Medium",
TRUE ~ "High"
)
)
# 7. Remove duplicates
initial_rows <- nrow(listings)
listings <- listings %>% distinct()
duplicates_removed <- initial_rows - nrow(listings)
# Recreate target variable
listings$price_category <- cut(listings$price_numeric,
breaks = c(0, 100, 200, Inf),
labels = c("Budget", "MidMarket", "Premium"),
include.lowest = TRUE)
# Final dataset summary
final_dim <- dim(listings)
final_target_dist <- table(listings$price_category)
final_target_prop <- round(prop.table(final_target_dist), 3)
output_text <- paste0(output_text, "Final Cleaned Dataset:\nDimensions: ", final_dim[1], " rows x ", final_dim[2], " columns\n")
output_text <- paste0(output_text, "Complete cases: ", sum(complete.cases(listings)), "\n\n")
output_text <- paste0(output_text, "Final target distribution:\n")
for(level in names(final_target_dist)) {
output_text <- paste0(output_text, sprintf("%-10s : %d (%.3f)\n", level, final_target_dist[level], final_target_prop[level]))
}
cat(output_text)## Initial data dimensions: 18187 rows x 22 columns
##
## Removed 3180 extreme price outliers (>$1000)
## Remaining observations: 15007
##
## - host_response_rate: 2541 missing (16.93%)
## - review_scores_rating: 2375 missing (15.83%)
## - reviews_per_month: 2375 missing (15.83%)
## - host_is_superhost: 486 missing (3.24%)
## - bedrooms: 18 missing (0.12%)
## - bathrooms: 5 missing (0.03%)
## - host_listings_count: 2 missing (0.01%)
## - host_identity_verified: 2 missing (0.01%)
##
##
## VERIFYING IMPUTATION RESULTS:
## All missing values successfully handled!
## Missing data imputation strategy completed.
##
## Final Cleaned Dataset:
## Dimensions: 15007 rows x 28 columns
## Complete cases: 15007
##
## Final target distribution:
## Budget : 2181 (0.145)
## MidMarket : 5433 (0.362)
## Premium : 7393 (0.493)
To prevent modeling errors from rare categories appearing only in train or test sets, we group infrequent levels into an “Other” category.
# Function to collapse rare categories into "Other"
collapse_rare_levels <- function(data, column, min_freq = 30) {
freq_table <- table(data[[column]])
rare_levels <- names(freq_table[freq_table < min_freq])
if (length(rare_levels) > 0) {
data[[column]] <- as.character(data[[column]])
data[[column]][data[[column]] %in% rare_levels] <- "Other"
data[[column]] <- as.factor(data[[column]])
}
return(data)
}
# Apply to high-cardinality categorical variables
listings <- collapse_rare_levels(listings, "property_type", min_freq = 30)
listings <- collapse_rare_levels(listings, "neighbourhood_cleansed", min_freq = 20)
cat("After collapsing rare categories:\n")## After collapsing rare categories:
## Unique property types: 25
## Unique neighbourhoods: 38
Exploratory data analysis was conducted to uncover key patterns and relationships within the Sydney Airbnb market (Inside Airbnb, 2025; Cox, 2024). The distribution of nightly prices reinforced the decision to classify listings into Budget, Mid-market, and Premium market segments. The type of room became as a major determinant of price, with entire homes and apartments commanding higher rates than shared or private rooms (Australian Bureau of Statistics, 2023; NSW Government, 2024). Additional analyses showed that listings with more reviews and greater availability tended to cluster in the Budget and Mid-market categories, whereas Premium properties were less frequent but typically associated with high demand tourist areaslike the Sydney city.
# Target distribution
p1 <- ggplot(listings, aes(x = price_category, fill = price_category)) +
geom_bar() +
geom_text(stat = 'count', aes(label = paste0(..count.., "\n(",
round(..count../sum(..count..)*100, 1), "%)")), vjust = -0.5) +
labs(title = "Distribution of Price Categories",
subtitle = "Classification target variable",
x = "Price Category", y = "Count") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
theme(legend.position = "none")
# Accommodates by category
p2 <- ggplot(listings, aes(x = accommodates, fill = price_category)) +
geom_histogram(bins = 15, position = "dodge", alpha = 0.7) +
labs(title = "Guest Capacity Distribution by Price Category",
x = "Number of Guests Accommodated", y = "Count") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
facet_wrap(~price_category, ncol = 1, scales = "free_y")
# Bedrooms by category
p3 <- ggplot(listings, aes(x = bedrooms, fill = price_category)) +
geom_histogram(bins = 10, position = "dodge", alpha = 0.7) +
labs(title = "Bedroom Distribution by Price Category",
x = "Number of Bedrooms", y = "Count") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
facet_wrap(~price_category, ncol = 1, scales = "free_y")
# Distance from CBD by category boxplot
p4 <- ggplot(listings, aes(x = price_category, y = distance_from_cbd, fill = price_category)) +
geom_boxplot() +
labs(title = "Distance from CBD by Price Category",
subtitle = "Premium properties tend to be closer to city center",
x = "Price Category", y = "Distance from CBD") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
theme(legend.position = "none")
# Histogram of actual price distribution within each category
p5 <- ggplot(listings, aes(x = price_numeric, fill = price_category)) +
geom_histogram(bins = 30, alpha = 0.7) +
labs(title = "Price Distribution Within Each Category",
subtitle = "Examining the spread of actual prices within Budget, MidMarket, and Premium tiers",
x = "Nightly Price (AUD)", y = "Count") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
facet_wrap(~price_category, ncol = 1, scales = "free") +
scale_x_continuous(labels = dollar_format(prefix = "$"))
# Arrange plots
grid.arrange(p1, p4, ncol=1)# Property type analysis
p3 <- listings %>%
count(property_type, price_category) %>%
group_by(property_type) %>%
filter(sum(n) >= 50) %>% # keep property types with >= 50 listings
ungroup() %>%
ggplot(aes(x = reorder(property_type, n), y = n, fill = price_category)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
labs(title = "Property Types by Price Category",
subtitle = "Only property types with 50+ listings shown",
x = "Property Type", y = "Count") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71",
"MidMarket" = "#f39c12",
"Premium" = "#e74c3c"))
# Room type analysis
p4 <- listings %>%
ggplot(aes(x = room_type, fill = price_category)) +
geom_bar(position = "fill") +
labs(title = "Room Type Composition by Price Category",
subtitle = "Proportion of each price category within room types",
x = "Room Type", y = "Proportion") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71",
"MidMarket" = "#f39c12",
"Premium" = "#e74c3c")) +
scale_y_continuous(labels = scales::percent_format()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Arrange both plots
gridExtra::grid.arrange(p3, p4, ncol = 2)# Statistical summary of numeric features by price category
numeric_summary <- listings %>%
dplyr::select(price_category, accommodates, bedrooms, bathrooms, host_listings_count,
number_of_reviews, review_scores_rating, availability_365,
distance_from_cbd, amenities_count, minimum_nights) %>%
group_by(price_category) %>%
summarise(
mean_accommodates = mean(accommodates, na.rm = TRUE),
mean_bedrooms = mean(bedrooms, na.rm = TRUE),
mean_bathrooms = mean(bathrooms, na.rm = TRUE),
mean_amenities = mean(amenities_count, na.rm = TRUE),
mean_reviews = mean(number_of_reviews, na.rm = TRUE),
mean_rating = mean(review_scores_rating, na.rm = TRUE),
mean_distance_cbd = mean(distance_from_cbd, na.rm = TRUE),
mean_availability = mean(availability_365, na.rm = TRUE)
)
print(kable(numeric_summary, digits = 2,
caption = "Mean Feature Values by Price Category"))##
##
## Table: Mean Feature Values by Price Category
##
## |price_category | mean_accommodates| mean_bedrooms| mean_bathrooms| mean_amenities| mean_reviews| mean_rating| mean_distance_cbd| mean_availability|
## |:--------------|-----------------:|-------------:|--------------:|--------------:|------------:|-----------:|-----------------:|-----------------:|
## |Budget | 1.85| 1.06| 1.22| 29.36| 34.01| 4.61| 0.15| 220.45|
## |MidMarket | 3.04| 1.23| 1.16| 35.53| 55.43| 4.73| 0.10| 178.02|
## |Premium | 4.92| 2.26| 1.62| 40.40| 32.65| 4.77| 0.10| 194.45|
# Boxplots comparing key numeric features across categories
p1 <- ggplot(listings, aes(x = price_category, y = accommodates, fill = price_category)) +
geom_boxplot() +
labs(title = "Guest Capacity by Category", x = "", y = "Accommodates") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
theme(legend.position = "none")
p2 <- ggplot(listings, aes(x = price_category, y = bedrooms, fill = price_category)) +
geom_boxplot() +
labs(title = "Bedrooms by Category", x = "", y = "Bedrooms") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
theme(legend.position = "none")
p3 <- ggplot(listings, aes(x = price_category, y = amenities_count, fill = price_category)) +
geom_boxplot() +
labs(title = "Amenities by Category", x = "", y = "Amenity Count") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
theme(legend.position = "none")
p4 <- ggplot(listings, aes(x = price_category, y = review_scores_rating, fill = price_category)) +
geom_boxplot() +
labs(title = "Review Scores by Category", x = "", y = "Rating") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
theme(legend.position = "none")
p5 <- ggplot(listings, aes(x = price_category, y = availability_365, fill = price_category)) +
geom_boxplot() +
labs(title = "Availability by Category", x = "Price Category", y = "Days Available") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
theme(legend.position = "none")
p6 <- ggplot(listings, aes(x = price_category, y = number_of_reviews, fill = price_category)) +
geom_boxplot() +
labs(title = "Review Count by Category", x = "Price Category", y = "Number of Reviews") +
theme_minimal() +
scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
theme(legend.position = "none")
# Arrange plots
grid.arrange(p1, p2, p3, p4, p5, p6, ncol=3)# Categorical feature distributions by price category
cat("\n\nHost Superhost Status by Price Category:\n")##
##
## Host Superhost Status by Price Category:
##
## FALSE
## Budget 1
## MidMarket 1
## Premium 1
##
##
## Room Type Distribution by Price Category:
##
## Entire home/apt Hotel room Private room Shared room
## Budget 0.1357175608 0.0032095369 0.8486932600 0.0123796424
## MidMarket 0.8210933186 0.0064421130 0.1717283269 0.0007362415
## Premium 0.9500879210 0.0048694711 0.0449073448 0.0001352631
After completing the exploratory data analysis, we export the cleaned and processed dataset for potential future use.
# Export the cleaned dataset with all engineered features
output_file <- "listings_cleaned_with_features.csv"
write_csv(listings, output_file)
cat("Cleaned dataset exported successfully!\n")## Cleaned dataset exported successfully!
## File: listings_cleaned_with_features.csv
## Location: /Users/ABRAHAM/Documents/USYD/Sem 2/Computational Statistical Methods- STAT5003/Assignment2
## Dimensions: 15007 rows x 28 columns
##
## This dataset includes:
## - Original features after cleaning and imputation
## - Target variable: price_category (Budget, MidMarket, Premium)
## - Engineered features: amenities_count, distance_from_cbd, is_popular_area,
## property_size, host_experience, availability_level
The modelling phase focuses on predicting Airbnb price categories using a classification approach (Inside Airbnb, 2025; Cox, 2024). To ensure robust results, five machine learning algorithms were selected that were discussed as part of the course. The dataset will be split into training and test sets, with cross-validation applied during training to minimize overfitting and improve generalization (Dhummad, 2025; Katyal, Sharma, & Kannan, 2025). Model performance will be assessed using multiple evaluation metrics - accuracy for overall correctness, precision and recall to capture domain performance, and macro or weighted F1-scores to account for potential class imbalance across the three price tiers. This comprehensive modeling plan balances interpretability with predictive accuracy, providing both actionable insights and reliable classification outcomes.
We implement five classification ML algorithms, prioritizing methods taught in STAT5003 class to predict Sydney Airbnb price categories.
| Model | Purpose | Strengths | Use Case | Rationale for Dataset |
|---|---|---|---|---|
| Multinomial Logistic Regression | Baseline interpretable model | Interpretable, fast, probability outputs | Linear relationships | Provides transparent baseline for feature contributions |
| Random Forest | Ensemble method | Handles mixed data, resistant to overfitting, feature importance | Captures non-linear relationships | Handles categorical & numerical features, identifies key drivers |
| Support Vector Machine | High-dimensional classification | Robust to outliers, flexible boundaries | Complex decision boundaries | Separates overlapping price categories using kernels |
| Linear Discriminant Analysis | Dimensionality reduction | Simple, interpretable, efficient | Maximize class separation | Reduces redundancy in correlated features |
| K Nearest Neighbors | Non-parametric, instance-based | Simple, local pattern recognition | Geographic/neighborhood patterns | Leverages localized pricing similarity |
The Sydney Airbnb dataset can be split in 70% training data and 30% test data. We can split the dataset into training and testing sets to ensure that our classification models learn patterns effectively and can generalize well into new and unseen data.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:MLmetrics':
##
## MAE, RMSE
## The following object is masked from 'package:purrr':
##
## lift
# Stratified train-test split ratio (70:30)
set.seed(123)
train_indices <- createDataPartition(listings$price_category, p = 0.7, list = FALSE)
train_data <- listings[train_indices, ]
test_data <- listings[-train_indices, ]
# Compute sizes and percentages
train_size <- nrow(train_data)
test_size <- nrow(test_data)
train_pct <- round(train_size / nrow(listings) * 100, 1)
test_pct <- round(test_size / nrow(listings) * 100, 1)
# Class distributions
train_dist <- round(prop.table(table(train_data$price_category)), 3)
test_dist <- round(prop.table(table(test_data$price_category)), 3)
output_text <- paste0(
"Data Splitting Summary:\n",
"Training set size: ", train_size, " (", train_pct, "%)\n",
"Test set size : ", test_size, " (", test_pct, "%)\n\n",
"Class distribution in training set:\n",
paste(names(train_dist), ":", train_dist, collapse = "\n"), "\n\n",
"Class distribution in test set:\n",
paste(names(test_dist), ":", test_dist, collapse = "\n"), "\n"
)
cat(output_text)## Data Splitting Summary:
## Training set size: 10507 (70%)
## Test set size : 4500 (30%)
##
## Class distribution in training set:
## Budget : 0.145
## MidMarket : 0.362
## Premium : 0.493
##
## Class distribution in test set:
## Budget : 0.145
## MidMarket : 0.362
## Premium : 0.493
Our comprehensive evaluation metrics can be classified under the following categories:
1. Overall Performance Metrics
# Prepare features for modeling
prepare_features <- function(data) {
model_data <- data %>%
dplyr::select(
# Numeric features
accommodates, bedrooms, bathrooms, host_listings_count,
number_of_reviews, review_scores_rating, availability_365,
minimum_nights, distance_from_cbd, amenities_count,
# Categorical features
property_type, room_type, neighbourhood_cleansed,
host_is_superhost, host_identity_verified,
is_popular_area, property_size, host_experience,
availability_level,
# Target variable
price_category
) %>%
na.omit() # Remove any remaining missing values
return(model_data)
}
# Preparing train and test datasets
train_features <- prepare_features(train_data)
test_features <- prepare_features(test_data)
# Get feature names excluding target variable
feature_names <- names(train_features)[names(train_features) != "price_category"]
# Feature type summary
feature_types <- train_features %>%
dplyr::select(-price_category) %>%
summarise_all(~ifelse(is.numeric(.), "Numeric", "Categorical")) %>%
gather(Feature, Type) %>%
count(Type)
# Format feature summary as text
feature_summary_text <- paste0(feature_types$Type, ": ", feature_types$n, collapse = ", ")
# Format feature names as single line with pipes
feature_names_text <- paste(feature_names, collapse = " | ")
cat(
"Feature Preparation Summary:\n",
"Training features shape: ", dim(train_features)[1], " rows x ", dim(train_features)[2], " columns\n",
"Test features shape : ", dim(test_features)[1], " rows x ", dim(test_features)[2], " columns\n",
"Number of features for modeling (excluding target): ", ncol(train_features) - 1, "\n\n",
"Feature type summary: ", feature_summary_text, "\n\n",
"The 19 features for modeling:\n",
feature_names_text, "\n"
)## Feature Preparation Summary:
## Training features shape: 10507 rows x 20 columns
## Test features shape : 4500 rows x 20 columns
## Number of features for modeling (excluding target): 19
##
## Feature type summary: Categorical: 9, Numeric: 10
##
## The 19 features for modeling:
## accommodates | bedrooms | bathrooms | host_listings_count | number_of_reviews | review_scores_rating | availability_365 | minimum_nights | distance_from_cbd | amenities_count | property_type | room_type | neighbourhood_cleansed | host_is_superhost | host_identity_verified | is_popular_area | property_size | host_experience | availability_level
The initial dataset consisted of 20 features, including 12 numeric (such as id, accommodates, bedrooms, bathrooms, latitude, longitude, host_listings_count, review_scores_rating, number_of_reviews, reviews_per_month, availability_365, and minimum_nights), 6 character (price, property_type, room_type, amenities, neighbourhood_cleansed, and host_response_rate), and 2 boolean variables (host_is_superhost and host_identity_verified).
From these, 8 additional features were engineered: price_numeric and price_category from price, amenities_count from amenities, is_popular_area from neighbourhood_cleansed, distance_from_cbd from latitude and longitude, host_experience from host_listings_count, property_size from accommodates, and availability_level from availability_365.
For machine learning modeling, we finalized 19 predictive features—accommodates, bedrooms, bathrooms, host_listings_count, number_of_reviews, review_scores_rating, availability_365, minimum_nights, distance_from_cbd, amenities_count, property_type, room_type, neighbourhood_cleansed, host_is_superhost, host_identity_verified, is_popular_area, property_size, host_experience, and availability_level—with the target variable defined as price_category (Budget, Mid-Market, Premium).
Each model will undergo systematic hyperparameter optimization to select the best performing parameters:
ntree: Number of trees (500, 1000, 1500)mtry: Variables per split (sqrt(p), p/3, p/2)nodesize: Minimum node size (1, 5, 10)prior: Prior probabilities (equal, proportional to
class frequencies, custom)method: Estimation method (moment, mle, mve, t)nu: Degrees of freedom for method=“t” (5, 10, 20)tol: Tolerance for rank deficiency (1e-4, 1e-6,
1e-8)cost: Regularization parameter (0.1, 1, 10, 100)kernel: Kernel type (linear, radial, polynomial)gamma: Kernel coefficient (0.001, 0.01, 0.1, 1)k: Number of neighbors (3, 5, 7, 9, 11, 15)In this section, we implement five classification algorithms and evaluate their performance in predicting Sydney Airbnb price categories. Each model is trained using 3-fold cross-validation and evaluated on the held-out test set.
# Verify price_category levels are valid R names
cat("Price category levels:", levels(train_features$price_category), "\n")## Price category levels: Budget MidMarket Premium
## Training set dimensions: 10507 x 20
## Test set dimensions: 4500 x 20
Multinomial logistic regression serves as our baseline interpretable model, extending binary logistic regression to handle three price categories simultaneously.
library(nnet)
library(caret)
# Set up cross-validation with repeated k-fold
train_control <- trainControl(
method = "repeatedcv",
number = 5, # 5-fold cross-validation
repeats = 3, # 3 repetitions for robust estimates
classProbs = TRUE,
summaryFunction = multiClassSummary,
savePredictions = "final",
verboseIter = FALSE
)
# Train multinomial logistic regression
set.seed(123)
model_logit <- train(
price_category ~ accommodates + bedrooms + bathrooms + host_listings_count +
number_of_reviews + review_scores_rating + availability_365 +
minimum_nights + distance_from_cbd + amenities_count +
property_type + room_type + neighbourhood_cleansed +
host_is_superhost + host_identity_verified + is_popular_area +
property_size + host_experience + availability_level,
data = train_features,
method = "multinom",
trControl = train_control,
trace = FALSE,
MaxNWts = 5000
)
# Predictions
logit_pred <- predict(model_logit, test_features)
logit_pred_prob <- predict(model_logit, test_features, type = "prob")
# Confusion Matrix
logit_cm <- confusionMatrix(logit_pred, test_features$price_category)
print(logit_cm)## Confusion Matrix and Statistics
##
## Reference
## Prediction Budget MidMarket Premium
## Budget 502 134 23
## MidMarket 131 1045 403
## Premium 21 450 1791
##
## Overall Statistics
##
## Accuracy : 0.7418
## 95% CI : (0.7287, 0.7545)
## No Information Rate : 0.4927
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5725
##
## Mcnemar's Test P-Value : 0.4378
##
## Statistics by Class:
##
## Class: Budget Class: MidMarket Class: Premium
## Sensitivity 0.7676 0.6415 0.8078
## Specificity 0.9592 0.8140 0.7937
## Pos Pred Value 0.7618 0.6618 0.7918
## Neg Pred Value 0.9604 0.8001 0.8097
## Prevalence 0.1453 0.3620 0.4927
## Detection Rate 0.1116 0.2322 0.3980
## Detection Prevalence 0.1464 0.3509 0.5027
## Balanced Accuracy 0.8634 0.7277 0.8008
# Store results
logit_accuracy <- logit_cm$overall['Accuracy']
cat("\nLogistic Regression Test Accuracy:", round(logit_accuracy, 4), "\n")##
## Logistic Regression Test Accuracy: 0.7418
Random Forest handles non-linear relationships and feature interactions through ensemble learning with decision trees.
library(randomForest)
# Train Random Forest with comprehensive hyperparameter tuning
set.seed(123)
model_rf <- train(
price_category ~ accommodates + bedrooms + bathrooms + host_listings_count +
number_of_reviews + review_scores_rating + availability_365 +
minimum_nights + distance_from_cbd + amenities_count +
property_type + room_type + neighbourhood_cleansed +
host_is_superhost + host_identity_verified + is_popular_area +
property_size + host_experience + availability_level,
data = train_features,
method = "rf",
trControl = train_control,
ntree = 500, # Increased to 500 trees for more stable predictions
importance = TRUE,
tuneGrid = data.frame(mtry = c(4, 6, 9)) # sqrt(p) ≈ 4, p/3 ≈ 6, p/2 ≈ 9
)
# Predictions
rf_pred <- predict(model_rf, test_features)
rf_pred_prob <- predict(model_rf, test_features, type = "prob")
# Confusion Matrix
rf_cm <- confusionMatrix(rf_pred, test_features$price_category)
print(rf_cm)## Confusion Matrix and Statistics
##
## Reference
## Prediction Budget MidMarket Premium
## Budget 515 116 22
## MidMarket 126 1135 372
## Premium 13 378 1823
##
## Overall Statistics
##
## Accuracy : 0.7718
## 95% CI : (0.7592, 0.784)
## No Information Rate : 0.4927
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6229
##
## Mcnemar's Test P-Value : 0.4275
##
## Statistics by Class:
##
## Class: Budget Class: MidMarket Class: Premium
## Sensitivity 0.7875 0.6967 0.8223
## Specificity 0.9641 0.8265 0.8287
## Pos Pred Value 0.7887 0.6950 0.8234
## Neg Pred Value 0.9639 0.8277 0.8276
## Prevalence 0.1453 0.3620 0.4927
## Detection Rate 0.1144 0.2522 0.4051
## Detection Prevalence 0.1451 0.3629 0.4920
## Balanced Accuracy 0.8758 0.7616 0.8255
# Feature Importance
rf_importance <- varImp(model_rf)
print(plot(rf_importance, top = 15, main = "Top 15 Important Features - Random Forest"))# Store results
rf_accuracy <- rf_cm$overall['Accuracy']
cat("\nRandom Forest Test Accuracy:", round(rf_accuracy, 4), "\n")##
## Random Forest Test Accuracy: 0.7718
SVM with radial basis function kernel creates complex decision boundaries in high-dimensional space.
library(e1071)
# Train SVM with RBF kernel and expanded hyperparameter grid
set.seed(123)
model_svm <- train(
price_category ~ accommodates + bedrooms + bathrooms + host_listings_count +
number_of_reviews + review_scores_rating + availability_365 +
minimum_nights + distance_from_cbd + amenities_count +
property_type + room_type + neighbourhood_cleansed +
host_is_superhost + host_identity_verified + is_popular_area +
property_size + host_experience + availability_level,
data = train_features,
method = "svmRadial",
trControl = train_control,
preProcess = c("center", "scale"),
tuneLength = 5 # Test 5 different cost/sigma combinations
)## line search fails -2.840481 0.04220264 1.036726e-05 6.663514e-06 -5.242233e-08 -1.732632e-08 -6.589302e-13
# Predictions
svm_pred <- predict(model_svm, test_features)
svm_pred_prob <- predict(model_svm, test_features, type = "prob")
# Confusion Matrix
svm_cm <- confusionMatrix(svm_pred, test_features$price_category)
print(svm_cm)## Confusion Matrix and Statistics
##
## Reference
## Prediction Budget MidMarket Premium
## Budget 488 142 25
## MidMarket 147 1056 381
## Premium 19 431 1811
##
## Overall Statistics
##
## Accuracy : 0.7456
## 95% CI : (0.7326, 0.7582)
## No Information Rate : 0.4927
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5787
##
## Mcnemar's Test P-Value : 0.2633
##
## Statistics by Class:
##
## Class: Budget Class: MidMarket Class: Premium
## Sensitivity 0.7462 0.6483 0.8169
## Specificity 0.9566 0.8161 0.8029
## Pos Pred Value 0.7450 0.6667 0.8010
## Neg Pred Value 0.9568 0.8035 0.8187
## Prevalence 0.1453 0.3620 0.4927
## Detection Rate 0.1084 0.2347 0.4024
## Detection Prevalence 0.1456 0.3520 0.5024
## Balanced Accuracy 0.8514 0.7322 0.8099
# Store results
svm_accuracy <- svm_cm$overall['Accuracy']
cat("\nSVM Test Accuracy:", round(svm_accuracy, 4), "\n")##
## SVM Test Accuracy: 0.7456
LDA finds linear combinations of features that best separate the three price categories. We use only numeric features to avoid collinearity issues with categorical variables.
library(MASS)
# Train LDA with numeric features only (avoiding categorical variables that cause collinearity)
set.seed(123)
tryCatch({
model_lda <- train(
price_category ~ accommodates + bedrooms + bathrooms + host_listings_count +
number_of_reviews + review_scores_rating + availability_365 +
minimum_nights + distance_from_cbd + amenities_count,
data = train_features,
method = "lda",
trControl = train_control,
preProcess = c("center", "scale")
)
# Predictions
lda_pred <- predict(model_lda, test_features)
lda_pred_prob <- predict(model_lda, test_features, type = "prob")
# Confusion Matrix
lda_cm <- confusionMatrix(lda_pred, test_features$price_category)
print(lda_cm)
# Store results
lda_accuracy <- lda_cm$overall['Accuracy']
cat("\nLDA Test Accuracy:", round(lda_accuracy, 4), "\n")
cat("Note: LDA uses numeric features only to avoid collinearity issues.\n")
}, error = function(e) {
cat("\nLDA model failed due to collinearity issues. Using Naive Bayes as alternative.\n")
cat("Error message:", conditionMessage(e), "\n")
# Use Naive Bayes as a simpler alternative
model_lda <<- train(
price_category ~ accommodates + bedrooms + bathrooms + host_listings_count +
number_of_reviews + review_scores_rating + availability_365 +
minimum_nights + distance_from_cbd + amenities_count +
room_type,
data = train_features,
method = "naive_bayes",
trControl = train_control
)
lda_pred <<- predict(model_lda, test_features)
lda_pred_prob <<- predict(model_lda, test_features, type = "prob")
lda_cm <<- confusionMatrix(lda_pred, test_features$price_category)
print(lda_cm)
lda_accuracy <<- lda_cm$overall['Accuracy']
cat("\nNaive Bayes (Alternative) Test Accuracy:", round(lda_accuracy, 4), "\n")
})## Confusion Matrix and Statistics
##
## Reference
## Prediction Budget MidMarket Premium
## Budget 281 109 60
## MidMarket 354 1060 508
## Premium 19 460 1649
##
## Overall Statistics
##
## Accuracy : 0.6644
## 95% CI : (0.6504, 0.6782)
## No Information Rate : 0.4927
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4388
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: Budget Class: MidMarket Class: Premium
## Sensitivity 0.42966 0.6507 0.7438
## Specificity 0.95606 0.6998 0.7902
## Pos Pred Value 0.62444 0.5515 0.7749
## Neg Pred Value 0.90790 0.7793 0.7605
## Prevalence 0.14533 0.3620 0.4927
## Detection Rate 0.06244 0.2356 0.3664
## Detection Prevalence 0.10000 0.4271 0.4729
## Balanced Accuracy 0.69286 0.6752 0.7670
##
## LDA Test Accuracy: 0.6644
## Note: LDA uses numeric features only to avoid collinearity issues.
KNN classifies properties based on similarity to their nearest neighbors in feature space.
# Train KNN with expanded k-value testing
set.seed(123)
model_knn <- train(
price_category ~ accommodates + bedrooms + bathrooms + host_listings_count +
number_of_reviews + review_scores_rating + availability_365 +
minimum_nights + distance_from_cbd + amenities_count +
property_type + room_type + neighbourhood_cleansed +
host_is_superhost + host_identity_verified + is_popular_area +
property_size + host_experience + availability_level,
data = train_features,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale"),
tuneGrid = expand.grid(k = c(3, 5, 7, 9, 11, 15)) # Test 6 different k values
)
# Predictions
knn_pred <- predict(model_knn, test_features)
knn_pred_prob <- predict(model_knn, test_features, type = "prob")
# Confusion Matrix
knn_cm <- confusionMatrix(knn_pred, test_features$price_category)
print(knn_cm)## Confusion Matrix and Statistics
##
## Reference
## Prediction Budget MidMarket Premium
## Budget 460 148 29
## MidMarket 168 983 447
## Premium 26 498 1741
##
## Overall Statistics
##
## Accuracy : 0.7076
## 95% CI : (0.694, 0.7208)
## No Information Rate : 0.4927
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5149
##
## Mcnemar's Test P-Value : 0.2425
##
## Statistics by Class:
##
## Class: Budget Class: MidMarket Class: Premium
## Sensitivity 0.7034 0.6034 0.7853
## Specificity 0.9540 0.7858 0.7705
## Pos Pred Value 0.7221 0.6151 0.7687
## Neg Pred Value 0.9498 0.7774 0.7870
## Prevalence 0.1453 0.3620 0.4927
## Detection Rate 0.1022 0.2184 0.3869
## Detection Prevalence 0.1416 0.3551 0.5033
## Balanced Accuracy 0.8287 0.6946 0.7779
# Store results
knn_accuracy <- knn_cm$overall['Accuracy']
cat("\nKNN Test Accuracy:", round(knn_accuracy, 4), "\n")##
## KNN Test Accuracy: 0.7076
## Optimal K: 7
# Compile all model results
# Check if LDA was replaced with Naive Bayes
lda_model_name <- if(exists("model_lda") && model_lda$method == "naive_bayes") {
"Naive Bayes"
} else {
"LDA"
}
model_names <- c("Logistic Regression", "Random Forest", "SVM", lda_model_name, "KNN")
confusion_matrices <- list(logit_cm, rf_cm, svm_cm, lda_cm, knn_cm)
# Extract metrics for each model
metrics_df <- data.frame(
Model = model_names,
Accuracy = sapply(confusion_matrices, function(cm) cm$overall['Accuracy']),
Kappa = sapply(confusion_matrices, function(cm) cm$overall['Kappa']),
Sensitivity_Budget = sapply(confusion_matrices, function(cm) cm$byClass[1, 'Sensitivity']),
Specificity_Budget = sapply(confusion_matrices, function(cm) cm$byClass[1, 'Specificity']),
Precision_Budget = sapply(confusion_matrices, function(cm) cm$byClass[1, 'Pos Pred Value']),
F1_Budget = sapply(confusion_matrices, function(cm) cm$byClass[1, 'F1']),
Sensitivity_MidMarket = sapply(confusion_matrices, function(cm) cm$byClass[2, 'Sensitivity']),
Specificity_MidMarket = sapply(confusion_matrices, function(cm) cm$byClass[2, 'Specificity']),
Precision_MidMarket = sapply(confusion_matrices, function(cm) cm$byClass[2, 'Pos Pred Value']),
F1_MidMarket = sapply(confusion_matrices, function(cm) cm$byClass[2, 'F1']),
Sensitivity_Premium = sapply(confusion_matrices, function(cm) cm$byClass[3, 'Sensitivity']),
Specificity_Premium = sapply(confusion_matrices, function(cm) cm$byClass[3, 'Specificity']),
Precision_Premium = sapply(confusion_matrices, function(cm) cm$byClass[3, 'Pos Pred Value']),
F1_Premium = sapply(confusion_matrices, function(cm) cm$byClass[3, 'F1'])
)
# Display comprehensive metrics table
print(kable(metrics_df, digits = 4, caption = "Comprehensive Model Performance Metrics"))##
##
## Table: Comprehensive Model Performance Metrics
##
## |Model | Accuracy| Kappa| Sensitivity_Budget| Specificity_Budget| Precision_Budget| F1_Budget| Sensitivity_MidMarket| Specificity_MidMarket| Precision_MidMarket| F1_MidMarket| Sensitivity_Premium| Specificity_Premium| Precision_Premium| F1_Premium|
## |:-------------------|--------:|------:|------------------:|------------------:|----------------:|---------:|---------------------:|---------------------:|-------------------:|------------:|-------------------:|-------------------:|-----------------:|----------:|
## |Logistic Regression | 0.7418| 0.5725| 0.7676| 0.9592| 0.7618| 0.7647| 0.6415| 0.8140| 0.6618| 0.6515| 0.8078| 0.7937| 0.7918| 0.7997|
## |Random Forest | 0.7718| 0.6229| 0.7875| 0.9641| 0.7887| 0.7881| 0.6967| 0.8265| 0.6950| 0.6959| 0.8223| 0.8287| 0.8234| 0.8228|
## |SVM | 0.7456| 0.5787| 0.7462| 0.9566| 0.7450| 0.7456| 0.6483| 0.8161| 0.6667| 0.6573| 0.8169| 0.8029| 0.8010| 0.8088|
## |LDA | 0.6644| 0.4388| 0.4297| 0.9561| 0.6244| 0.5091| 0.6507| 0.6998| 0.5515| 0.5970| 0.7438| 0.7902| 0.7749| 0.7590|
## |KNN | 0.7076| 0.5149| 0.7034| 0.9540| 0.7221| 0.7126| 0.6034| 0.7858| 0.6151| 0.6092| 0.7853| 0.7705| 0.7687| 0.7769|
# Calculate macro-averaged metrics
metrics_df$Macro_F1 <- rowMeans(cbind(metrics_df$F1_Budget,
metrics_df$F1_MidMarket,
metrics_df$F1_Premium), na.rm = TRUE)
# Overall performance visualization
p1 <- ggplot(metrics_df, aes(x = reorder(Model, Accuracy), y = Accuracy, fill = Model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(Accuracy, 3)), vjust = -0.5, size = 3.5) +
coord_flip() +
labs(title = "Model Accuracy Comparison",
x = "Model", y = "Accuracy") +
theme_minimal() +
theme(legend.position = "none") +
ylim(0, 1)
p2 <- ggplot(metrics_df, aes(x = reorder(Model, Macro_F1), y = Macro_F1, fill = Model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(Macro_F1, 3)), vjust = -0.5, size = 3.5) +
coord_flip() +
labs(title = "Macro-Averaged F1 Score Comparison",
x = "Model", y = "Macro F1") +
theme_minimal() +
theme(legend.position = "none") +
ylim(0, 1)
grid.arrange(p1, p2, ncol = 2)# Class-specific performance visualization
f1_scores <- data.frame(
Model = rep(model_names, 3),
Category = rep(c("Budget", "MidMarket", "Premium"), each = 5),
F1_Score = c(metrics_df$F1_Budget, metrics_df$F1_MidMarket, metrics_df$F1_Premium)
)
p3 <- ggplot(f1_scores, aes(x = Model, y = F1_Score, fill = Category)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "F1 Scores by Price Category",
x = "Model", y = "F1 Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c"))
print(p3)library(cvms)
library(tibble)
# Function to create confusion matrix plot
plot_confusion_matrix <- function(cm, title) {
cm_table <- as.data.frame(cm$table)
ggplot(cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), color = "white", size = 6, fontface = "bold") +
scale_fill_gradient(low = "#3498db", high = "#e74c3c") +
labs(title = title, x = "Actual Category", y = "Predicted Category") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
}
# Create confusion matrix plots for all models
cm1 <- plot_confusion_matrix(logit_cm, "Logistic Regression")
cm2 <- plot_confusion_matrix(rf_cm, "Random Forest")
cm3 <- plot_confusion_matrix(svm_cm, "SVM")
cm4 <- plot_confusion_matrix(lda_cm, "LDA")
cm5 <- plot_confusion_matrix(knn_cm, "KNN")
grid.arrange(cm1, cm2, cm3, cm4, cm5, ncol = 2)ROC curves provide insight into the trade-off between sensitivity (true positive rate) and specificity (false positive rate) for each price category.
library(pROC)
library(ggplot2)
# Function to calculate ROC for each class in multi-class problem
calculate_multiclass_roc <- function(predictions, actual, model_name) {
roc_list <- list()
auc_values <- c()
# One-vs-Rest approach for each class
classes <- levels(actual)
for(class in classes) {
# Create binary outcome: current class vs all others
binary_actual <- ifelse(actual == class, 1, 0)
class_prob <- predictions[, class]
# Calculate ROC
roc_obj <- roc(binary_actual, class_prob, quiet = TRUE)
roc_list[[class]] <- roc_obj
auc_values <- c(auc_values, auc(roc_obj))
}
return(list(roc_list = roc_list, auc_values = auc_values, classes = classes))
}
# Calculate ROC for all models
roc_logit <- calculate_multiclass_roc(logit_pred_prob, test_features$price_category, "Logistic Regression")
roc_rf <- calculate_multiclass_roc(rf_pred_prob, test_features$price_category, "Random Forest")
roc_svm <- calculate_multiclass_roc(svm_pred_prob, test_features$price_category, "SVM")
roc_lda <- calculate_multiclass_roc(lda_pred_prob, test_features$price_category, "LDA")
roc_knn <- calculate_multiclass_roc(knn_pred_prob, test_features$price_category, "KNN")
# Create ROC curve plot for each model
plot_roc_model <- function(roc_data, model_name) {
plot_data <- data.frame()
for(i in 1:length(roc_data$classes)) {
class <- roc_data$classes[i]
roc_obj <- roc_data$roc_list[[class]]
auc_val <- roc_data$auc_values[i]
temp_df <- data.frame(
Specificity = 1 - roc_obj$specificities,
Sensitivity = roc_obj$sensitivities,
Class = paste0(class, " (AUC=", round(auc_val, 3), ")")
)
plot_data <- rbind(plot_data, temp_df)
}
ggplot(plot_data, aes(x = Specificity, y = Sensitivity, color = Class)) +
geom_line(size = 1) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50") +
labs(title = paste("ROC Curves -", model_name),
x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensitivity)") +
theme_minimal() +
theme(legend.position = "bottom") +
coord_equal() +
xlim(0, 1) + ylim(0, 1)
}
# Create plots for all models
p_roc1 <- plot_roc_model(roc_logit, "Logistic Regression")
p_roc2 <- plot_roc_model(roc_rf, "Random Forest")
p_roc3 <- plot_roc_model(roc_svm, "SVM")
p_roc4 <- plot_roc_model(roc_lda, lda_model_name)
p_roc5 <- plot_roc_model(roc_knn, "KNN")
grid.arrange(p_roc1, p_roc2, p_roc3, p_roc4, p_roc5, ncol = 2)# Summary table of AUC values
auc_summary <- data.frame(
Model = c("Logistic Regression", "Random Forest", "SVM", lda_model_name, "KNN"),
AUC_Budget = c(roc_logit$auc_values[1], roc_rf$auc_values[1], roc_svm$auc_values[1],
roc_lda$auc_values[1], roc_knn$auc_values[1]),
AUC_MidMarket = c(roc_logit$auc_values[2], roc_rf$auc_values[2], roc_svm$auc_values[2],
roc_lda$auc_values[2], roc_knn$auc_values[2]),
AUC_Premium = c(roc_logit$auc_values[3], roc_rf$auc_values[3], roc_svm$auc_values[3],
roc_lda$auc_values[3], roc_knn$auc_values[3])
)
auc_summary$Mean_AUC <- rowMeans(auc_summary[, 2:4])
print(kable(auc_summary, digits = 4, caption = "AUC Values by Model and Price Category"))##
##
## Table: AUC Values by Model and Price Category
##
## |Model | AUC_Budget| AUC_MidMarket| AUC_Premium| Mean_AUC|
## |:-------------------|----------:|-------------:|-----------:|--------:|
## |Logistic Regression | 0.9584| 0.8189| 0.8901| 0.8891|
## |Random Forest | 0.9676| 0.8522| 0.9108| 0.9102|
## |SVM | 0.9579| 0.8224| 0.8960| 0.8921|
## |LDA | 0.8935| 0.7582| 0.8427| 0.8315|
## |KNN | 0.9227| 0.7722| 0.8574| 0.8508|
##
## ROC Curve Interpretation:
## - AUC = 1.0: Perfect classification
## - AUC = 0.5: Random guessing (diagonal line)
## - AUC > 0.8: Generally considered excellent
## - AUC 0.7-0.8: Good classification performance
# Identify best model
best_model_idx <- which.max(metrics_df$Accuracy)
best_model_name <- metrics_df$Model[best_model_idx]
best_accuracy <- metrics_df$Accuracy[best_model_idx]
cat("\n========================================\n")##
## ========================================
## BEST MODEL: Random Forest
## Test Accuracy: 0.7718
## Macro F1 Score: 0.7689
## ========================================
## Class-Specific Performance:
## Budget:
## - Sensitivity (Recall): 0.7875
## - Precision: 0.7887
## - F1 Score: 0.7881
## MidMarket:
## - Sensitivity (Recall): 0.6967
## - Precision: 0.695
## - F1 Score: 0.6959
## Premium:
## - Sensitivity (Recall): 0.8223
## - Precision: 0.8234
## - F1 Score: 0.8228
##
## Key Insights:
cat("- All models achieved >70% accuracy, demonstrating that Airbnb pricing patterns are learnable\n")## - All models achieved >70% accuracy, demonstrating that Airbnb pricing patterns are learnable
cat("- Random Forest likely performs best due to ability to capture non-linear feature interactions\n")## - Random Forest likely performs best due to ability to capture non-linear feature interactions
cat("- Geographic features (distance_from_cbd, neighbourhood) appear critical for classification\n")## - Geographic features (distance_from_cbd, neighbourhood) appear critical for classification
## - Property characteristics (bedrooms, accommodates) strongly differentiate price tiers
## - MidMarket category may be hardest to classify due to overlap with adjacent categories
This analysis establishes a robust foundation for understanding Sydney’s short-term rental market through the lens of data science. Our comprehensive examination of 15,000+ Airbnb properties reveals clear market segmentation patterns that reflect broader Australian housing dynamics.
Key Findings
Market Structure: Sydney’s accommodation market demonstrates distinct pricing tiers, with premium properties concentrated in iconic locations. The data reveals that location, amenities, and host quality are primary drivers of pricing power.
Data Quality: Through systematic cleaning and feature engineering, we transformed raw listing data into a modeling-ready dataset with 19 carefully selected features. Missing data patterns were strategically addressed using domain knowledge, achieving 100% data completeness.
Geographic Insights: Distance from Sydney’s CBD emerges as a critical pricing factor, while neighborhood-specific patterns highlight the premium commanded by waterfront and central locations.
Inside Airbnb. (2025). Sydney, New South Wales, Australia Dataset. Retrieved from http://insideairbnb.com/get-the-data/
Cox, M. (2024). Inside Airbnb: Adding Data to the Debate. Retrieved from http://insideairbnb.com/about.html
Australian Bureau of Statistics. (2023). Housing Occupancy and Costs. Retrieved from https://www.abs.gov.au/
NSW Government. (2024). Short-term Rental Accommodation Industry in NSW. Retrieved from https://www.nsw.gov.au/
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R (2nd ed.). Springer.
Dhummad, S. (2025). The Imperative of Exploratory Data Analysis in Machine Learning. Scholars Journal of Engineering and Technology, 13.
Katyal, A., Sharma, P. K., & Kannan, M. (2025). Exploratory Data Analysis (EDA) on Undergraduate Data Science Students Through R Programming.
Michelucci, U. (2025). Data Visualisation. In Statistics for Scientists: A Concise Guide for Data-driven Research (pp. 109-119). Cham: Springer Nature Switzerland.
# Creating a comprehensive data dictionary
data_dict <- data.frame(
Variable = c("price_category", "accommodates", "bedrooms", "bathrooms",
"property_type", "room_type", "neighbourhood_cleansed",
"latitude", "longitude", "host_is_superhost", "host_response_rate",
"host_listings_count", "review_scores_rating", "number_of_reviews",
"availability_365", "minimum_nights", "amenities_count",
"distance_from_cbd", "is_popular_area", "property_size"),
Type = c("Categorical", "Numeric", "Numeric", "Numeric",
"Categorical", "Categorical", "Categorical",
"Numeric", "Numeric", "Logical", "Numeric",
"Numeric", "Numeric", "Numeric",
"Numeric", "Numeric", "Numeric",
"Numeric", "Logical", "Categorical"),
Description = c("Target variable: Budget (<$100), Mid-Market ($100-200), Premium (>$200)",
"Maximum number of guests property can accommodate",
"Number of bedrooms available",
"Number of bathrooms available",
"Type of property (Apartment, House, etc.)",
"Type of rental (Entire home, Private room, Shared room)",
"Sydney neighbourhood/suburb name",
"Geographic latitude coordinate",
"Geographic longitude coordinate",
"Whether host has Superhost status",
"Host response rate as proportion (0-1)",
"Number of listings managed by host",
"Average review score rating (1-5 scale)",
"Total number of reviews received",
"Days available for booking per year",
"Minimum nights required for booking",
"Number of amenities provided",
"Calculated distance from Sydney CBD",
"Whether in popular tourist area",
"Property size category based on capacity")
)
kable(data_dict, caption = "Complete Data Dictionary for Model Features")| Variable | Type | Description |
|---|---|---|
| price_category | Categorical | Target variable: Budget (<$100), Mid-Market ($100-200), Premium (>$200) |
| accommodates | Numeric | Maximum number of guests property can accommodate |
| bedrooms | Numeric | Number of bedrooms available |
| bathrooms | Numeric | Number of bathrooms available |
| property_type | Categorical | Type of property (Apartment, House, etc.) |
| room_type | Categorical | Type of rental (Entire home, Private room, Shared room) |
| neighbourhood_cleansed | Categorical | Sydney neighbourhood/suburb name |
| latitude | Numeric | Geographic latitude coordinate |
| longitude | Numeric | Geographic longitude coordinate |
| host_is_superhost | Logical | Whether host has Superhost status |
| host_response_rate | Numeric | Host response rate as proportion (0-1) |
| host_listings_count | Numeric | Number of listings managed by host |
| review_scores_rating | Numeric | Average review score rating (1-5 scale) |
| number_of_reviews | Numeric | Total number of reviews received |
| availability_365 | Numeric | Days available for booking per year |
| minimum_nights | Numeric | Minimum nights required for booking |
| amenities_count | Numeric | Number of amenities provided |
| distance_from_cbd | Numeric | Calculated distance from Sydney CBD |
| is_popular_area | Logical | Whether in popular tourist area |
| property_size | Categorical | Property size category based on capacity |
Geographic Analysis
# Geographic distribution
ggplot(listings, aes(x = longitude, y = latitude, color = price_category)) +
geom_point(alpha = 0.6, size = 0.8) +
labs(title = "Geographic Distribution of Properties by Price Category",
subtitle = "Sydney Airbnb listings colored by price segment",
x = "Longitude", y = "Latitude") +
theme_minimal() +
scale_color_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1)))
Neighbourhood Analysis
# Top neighbourhoods by count
top_neighbourhoods <- listings %>%
count(neighbourhood_cleansed, sort = TRUE) %>%
head(15)
p7 <- ggplot(top_neighbourhoods, aes(x = reorder(neighbourhood_cleansed, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 15 Sydney Neighbourhoods by Listing Count",
x = "Neighbourhood", y = "Number of Listings") +
theme_minimal()
# Median price by neighbourhood
neighbourhood_price <- listings %>%
filter(neighbourhood_cleansed %in% top_neighbourhoods$neighbourhood_cleansed) %>%
group_by(neighbourhood_cleansed) %>%
summarise(
count = n(),
median_price = median(price_numeric),
premium_pct = mean(price_category == "Premium") * 100
) %>%
arrange(desc(median_price))
p8 <- ggplot(neighbourhood_price, aes(x = reorder(neighbourhood_cleansed, median_price),
y = median_price)) +
geom_col(fill = "darkgreen") +
coord_flip() +
labs(title = "Median Price by Neighbourhood",
subtitle = "Top 15 neighbourhoods by listing count",
x = "Neighbourhood", y = "Median Price (AUD)") +
theme_minimal() +
scale_y_continuous(labels = dollar_format(prefix = "$"))
grid.arrange(p7, p8, ncol = 1)This analysis was conducted as part of STAT5003 Computational Statistical Methods coursework, focusing on real-world application of machine learning techniques to Australian housing market data. The report has been prepared with the assistance of artificial intelligence (AI) tools. AI was used to support tasks such as research support, grammar correction and clarity improvement. All content has been reviewed and verified by the team to ensure accuracy, relevance and alignment with project objectives.